Background:

Necrotizing enterocolitis (NEC) is a gastrointestinal disease that is the leading cause of premature infant death in the NICU, with a mortality rate of 20–50%. NEC is difficult to diagnose due to late stage signs such as abdominal distension, blood in stool, microbial imbalance, and gas in intestinal walls (pneumatosis intestinalis) which quickly progresses to local and systemic inflammation, multi-organ failure, and death. Even if the premature infant survives NEC, they may lead a decreased quality of life due to permanent bowel issues and neurodevelopmental delays from NEC exposure. The NEC profile is characterized by up-regulation of pro-inflammatory markers such as IL-6, IL-8, IL-1B, and TNF alpha.

Hypothesis:

We hypothesize that the development and severity of NEC can be attenuated by through inhibition of pro-inflammation pathways.

Dataset Information

SRA: SRP272342
file type: fastq.gz
Summary: Comparison of intestinal epithelial cell tissue of breast fed (control) and formula fed (NEC model) mice at P4
Importing Libraries
library(tximport)
library(readr)
library(tximportData)
library(DESeq2) #nra-seq analysis 
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply,
    parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval,
    evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min


Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    expand.grid, I, unname

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'

The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins,
    colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads,
    colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks,
    colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans,
    colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs,
    rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2,
    rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: 'Biobase'

The following object is masked from 'package:MatrixGenerics':

    rowMedians

The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians
library(pheatmap)
library(vsn)
library(RColorBrewer)
library(gplots)

Attaching package: 'gplots'

The following object is masked from 'package:IRanges':

    space

The following object is masked from 'package:S4Vectors':

    space

The following object is masked from 'package:stats':

    lowess
library(ggplot2)
library(genefilter)

Attaching package: 'genefilter'

The following objects are masked from 'package:MatrixGenerics':

    rowSds, rowVars

The following objects are masked from 'package:matrixStats':

    rowSds, rowVars

The following object is masked from 'package:readr':

    spec
library(AnnotationDbi)
library(org.Mm.eg.db)
library(dplyr)

Attaching package: 'dplyr'

The following object is masked from 'package:AnnotationDbi':

    select

The following object is masked from 'package:Biobase':

    combine

The following object is masked from 'package:matrixStats':

    count

The following objects are masked from 'package:GenomicRanges':

    intersect, setdiff, union

The following object is masked from 'package:GenomeInfoDb':

    intersect

The following objects are masked from 'package:IRanges':

    collapse, desc, intersect, setdiff, slice, union

The following objects are masked from 'package:S4Vectors':

    first, intersect, rename, setdiff, setequal, union

The following objects are masked from 'package:BiocGenerics':

    combine, intersect, setdiff, union

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(DiagrammeR)
library(GenomicFeatures)
library(apeglm)
library(gage)
library(gageData)
library(pathview)

##############################################################################
Pathview is an open source software package distributed under GNU General
Public License version 3 (GPLv3). Details of GPLv3 is available at
http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
formally cite the original Pathview paper (not just mention it) in publications
or products. For details, do citation("pathview") within R.

The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
license agreement (details at http://www.kegg.jp/kegg/legal.html).
##############################################################################
library(biomaRt)
library(clusterProfiler)
clusterProfiler v4.0.0  For help: https://guangchuangyu.github.io/software/clusterProfiler

If you use clusterProfiler in published research, please cite:
Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.

Attaching package: 'clusterProfiler'

The following object is masked from 'package:biomaRt':

    select

The following object is masked from 'package:AnnotationDbi':

    select

The following object is masked from 'package:IRanges':

    slice

The following object is masked from 'package:S4Vectors':

    rename

The following object is masked from 'package:stats':

    filter

Design

grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']

      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4 -> tab5;
      }

      [1]: 'Obtain data'
      [2]: 'Process and clean raw data: fastq to readCounts'
      [3]: 'Observe sample quality'
      [4]: 'Data Visualization'
      [5]: 'Pathway Analysis' 
      ")

Obtain data

Downloading raw, zipped files from NCBI GEO repository

#process ()
#{
#        fasterq-dump $SRR
#        mv ${SRR}_1.fastq ${sample}_1.fastq; mv ${SRR}_2.fastq ${sample}_2.fastq
#        gzip ${sample}_1.fastq  ${sample}_2.fastq
#}

#for GSM in $(grep -i -v i3c GSMtable.txt | cut -f1)
#do
#        SRR=$(grep $GSM runInfo.txt | cut -f1)
#        sample=$(grep $GSM GSMtable.txt | cut -f2 | awk '{gsub(" ", ""); print}')
#        #echo $sample
#        process &
#done
#wait
#echo done 

Importing raw files into alignment tool (Salmon)

#for fn in sample.e144_14_ff sample.e144_17_ff sample.e144_19_ff sample.e144_1_bf sam>
#do  
#echo "Processing sample $fn"
#salmon quant -i /home/aaltamirano/Documents/nsbb552/genome_folder/alias/mm10/salmon_>
#        -1 ${fn}_1.fastq \
#        -2 ${fn}_2.fastq \
#        -p 24 --validateMappings -o quants/Salmon/${fn}
#done 
#echo "done" 

Setting up files for DESeq2 Analysis

# Setting working directory
dir <- "/home/aaltamirano/Documents/nsbb552/quants/Salmon"
list.files(dir)
 [1] "0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf" "mm10.refGene.gtf"                                    
 [3] "sample.e144_1_bf"                                     "sample.e144_14_ff"                                   
 [5] "sample.e144_17_ff"                                    "sample.e144_19_ff"                                   
 [7] "sample.e145_15_ff"                                    "sample.e145_2_bf"                                    
 [9] "sample.e145_3_bf"                                     "sample.e145_4_bf"                                    
[11] "samples.txt"                                         
# Make a gene ID x Transcript name data frame from reference genome (tx2gene)
# Order of columns matters 
txdb <- makeTxDbFromGFF("/home/aaltamirano/Documents/nsbb552/quants/Salmon/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... The "phase" metadata column contains non-NA values for features of type stop_codon. This information was
  ignored.OK
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- AnnotationDbi::select(txdb, keys = k, keytype = "TXNAME", columns = "GENEID")
'select()' returned 1:1 mapping between keys and columns
head(tx2gene)
#Make a samples.txt file with sample IDs - should match the sample IDs used for alignment/mapping
#Assign samples.txt to the samples variable
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples$condition <- factor(rep(c("A","B"),each=4)) #Adds a condition column
samples
# A=control B=NEC


files <- file.path(dir, samples$sample, "quant.sf")
names(files) <- paste0(samples$sample)
#files
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion = TRUE)
reading in files with read_tsv
1 2 3 4 5 6 7 8 
transcripts missing from tx2gene: 176
summarizing abundance
summarizing counts
summarizing length
#names(txi.salmon)
head(txi.salmon$counts)
                   sample.e145_2_bf sample.e145_3_bf sample.e145_4_bf sample.e144_1_bf sample.e144_14_ff
ENSMUSG00000000001              592              446              808          816.000          1076.000
ENSMUSG00000000003                0                0                0            0.000             0.000
ENSMUSG00000000028               41               25               72           31.000            55.000
ENSMUSG00000000037               10                9               15           21.001            24.000
ENSMUSG00000000049                2                1                3            5.000             4.000
ENSMUSG00000000056              239              157              345          253.000           292.089
                   sample.e144_17_ff sample.e144_19_ff sample.e145_15_ff
ENSMUSG00000000001               771          1071.371               803
ENSMUSG00000000003                 0             0.000                 0
ENSMUSG00000000028                69            75.000                68
ENSMUSG00000000037                17            21.000                35
ENSMUSG00000000049                 2             3.000                 1
ENSMUSG00000000056               207           339.000               294
#all(file.exists(files))
#head(files)
#file.exists(files)
data.frame(thefiles = files, doihave = file.exists(files))

Loading deseq2

dds <- DESeqDataSetFromTximport(txi.salmon, samples, ~condition)
using counts and average transcript lengths from tximport
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds$condition <- factor(dds$condition, levels = c("A","B"))

Processing data

dds <- DESeq(dds)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
dds
class: DESeqDataSet 
dim: 18036 8 
metadata(1): version
assays(6): counts avgTxLength ... H cooks
rownames(18036): ENSMUSG00000000001 ENSMUSG00000000028 ... ENSMUSG00000118474 ENSMUSG00000118486
rowData names(22): baseMean baseVar ... deviance maxCooks
colnames(8): sample.e145_2_bf sample.e145_3_bf ... sample.e144_19_ff sample.e145_15_ff
colData names(2): sample condition
res <- results(dds)
#res <- res[res$log2FoldChange >= 1, ]
summary(res)

out of 18036 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 824, 4.6%
LFC < 0 (down)     : 1057, 5.9%
outliers [1]       : 11, 0.061%
low counts [2]     : 3144, 17%
(mean count < 8)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Clean data

## log fold change shrinkage for visualization and ranking
resultsNames(dds)
[1] "Intercept"        "condition_B_vs_A"
resLFC <- lfcShrink(dds, coef="condition_B_vs_A", type="apeglm")
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
    Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
    sequence count data: removing the noise and preserving large differences.
    Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resLFC
log2 fold change (MAP): condition B vs A 
Wald test p-value: condition B vs A 
DataFrame with 18036 rows and 5 columns
                    baseMean log2FoldChange     lfcSE    pvalue      padj
                   <numeric>      <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000000001 779.95385     0.05630551  0.138118  0.587677  0.801338
ENSMUSG00000000028  51.85717     0.11377348  0.187239  0.294300  0.574282
ENSMUSG00000000037  18.06297     0.05102431  0.192146  0.539494  0.773729
ENSMUSG00000000049   3.04633    -0.03246901  0.205843  0.389853        NA
ENSMUSG00000000056 262.93131     0.00273615  0.148849  0.980164  0.991664
...                      ...            ...       ...       ...       ...
ENSMUSG00000118434  17.96429     0.00680990  0.188574  0.930129  0.971692
ENSMUSG00000118458  31.01807    -0.00784403  0.187435  0.917789  0.967527
ENSMUSG00000118462   1.56485    -0.02126945  0.205485  0.457317        NA
ENSMUSG00000118474   2.35901     0.01396206  0.203058  0.710999        NA
ENSMUSG00000118486   1.28008     0.01290908  0.204958  0.599901        NA
## Organizing results by smallest P-value
resOrdered <- res[order(res$pvalue),]
resOrdered
log2 fold change (MLE): condition B vs A 
Wald test p-value: condition B vs A 
DataFrame with 18036 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat      pvalue        padj
                   <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
ENSMUSG00000030017  310.8436        4.59658  0.259441  17.71726 3.08587e-70 4.59209e-66
ENSMUSG00000071356 1057.5221        2.79887  0.248813  11.24886 2.34592e-29 1.74548e-25
ENSMUSG00000069917 1299.4493       -2.90978  0.271248 -10.72741 7.56910e-27 3.75452e-23
ENSMUSG00000073940   99.9417       -3.66798  0.364057 -10.07528 7.10605e-24 2.64363e-20
ENSMUSG00000037033  552.1574        1.77785  0.178946   9.93511 2.92839e-23 8.71547e-20
...                      ...            ...       ...       ...         ...         ...
ENSMUSG00000084319  53.70694       -1.19994   3.22936 -0.371572          NA          NA
ENSMUSG00000084383  16.35271      -22.18544   3.38615 -6.551823          NA          NA
ENSMUSG00000094248   5.23463        5.62342   3.38978  1.658933          NA          NA
ENSMUSG00000096592   7.38043       -6.53535   3.38952 -1.928109          NA          NA
ENSMUSG00000114942  11.80172       20.08815   3.38622  5.932327          NA          NA
## How many adjusted p-values were less than 0.01?
sum(res$padj < 0.1, na.rm=TRUE)
[1] 1881
## Changing p-value to 0.05 
res05 <- results(dds, alpha=0.05)
summary(res05)

out of 18036 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 515, 2.9%
LFC < 0 (down)     : 706, 3.9%
outliers [1]       : 11, 0.061%
low counts [2]     : 3494, 19%
(mean count < 10)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
## How many adjusted p-values were less htan 0.05?
sum(res05$padj < 0.05, na.rm=TRUE)
[1] 1221

Observe quality of data

plotMA(res, ylim=c(-2,2))

plotMA(resLFC, ylim=c(-2,2))

#checked 
plotDispEsts(dds, main="Dispersion plot")

mcols(resLFC)$description
[1] "mean of normalized counts for all samples" "log2 fold change (MAP): condition B vs A" 
[3] "posterior SD: condition B vs A"            "Wald test p-value: condition B vs A"      
[5] "BH adjusted p-values"                     
ntd <- normTransform(dds)
rld <- rlogTransformation(dds)
head(assay(rld))
                   sample.e145_2_bf sample.e145_3_bf sample.e145_4_bf sample.e144_1_bf sample.e144_14_ff
ENSMUSG00000000001         9.586836         9.432189         9.459377         9.772670          9.740088
ENSMUSG00000000028         5.699223         5.549135         5.750399         5.518271          5.665467
ENSMUSG00000000037         4.113784         4.070128         4.068480         4.285997          4.162325
ENSMUSG00000000049         1.487834         1.424627         1.442311         1.594476          1.463742
ENSMUSG00000000056         8.274156         7.880116         7.883624         8.022380          8.202945
ENSMUSG00000000058         6.058225         6.075010         6.458566         6.145500          6.529041
                   sample.e144_17_ff sample.e144_19_ff sample.e145_15_ff
ENSMUSG00000000001          9.678711          9.676637          9.394747
ENSMUSG00000000028          5.884028          5.717651          5.699630
ENSMUSG00000000037          4.180546          4.124824          4.298698
ENSMUSG00000000049          1.478057          1.441119          1.402578
ENSMUSG00000000056          7.840829          8.116099          7.916973
ENSMUSG00000000058          5.953610          6.106999          6.163559
vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
                   sample.e145_2_bf sample.e145_3_bf sample.e145_4_bf sample.e144_1_bf sample.e144_14_ff
ENSMUSG00000000001         9.717893         9.482189         9.525596         9.995258          9.946287
ENSMUSG00000000028         6.871501         6.655288         6.936889         6.619648          6.826289
ENSMUSG00000000037         6.148628         6.074012         6.087192         6.415225          6.230451
                   sample.e144_17_ff sample.e144_19_ff sample.e145_15_ff
ENSMUSG00000000001          9.855570          9.852053          9.427373
ENSMUSG00000000028          7.116053          6.894127          6.871023
ENSMUSG00000000037          6.258165          6.175734          6.419137
meanSdPlot(assay(ntd))

meanSdPlot(assay(vsd))

meanSdPlot(assay(rld))


#Shows the effect of the transformation, i
ddsESF <- estimateSizeFactors(dds)
using 'avgTxLength' from assays(dds), correcting for library size
df1 <- data.frame(log2(counts(ddsESF, normalized=TRUE)[, 1:2] + 1))
df1$transformation <- "log2(x + 1)"
df2 <- data.frame(assay(rld)[, 1:2])
df2$transformation <- "rld"
df3 <- data.frame(assay(vsd)[, 1:2])
df3$transformation <- "vsd"
df <- rbind(df1, df2, df3)
colnames(df)[1:2] <- c("x", "y")
head(df)
table(df$transformation)

log2(x + 1)         rld         vsd 
      18036       18036       18036 
options(repr.plot.width=6, repr.plot.height=2.5)
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
coord_fixed() + facet_grid( . ~ transformation)
#rld graph compresses differences for the low count genes > excludes genes with low counts

par(mfrow=c(1,3))

boxplot(log2(assay(ddsESF)+1), las=2, main="log2(x+1)")
boxplot(assay(rld), las=2, main="rld")
boxplot(assay(vsd), las=2, main="vsd")


select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]
df <- as.data.frame(colData(dds)["condition"])
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)
Found more than one class "unit" in cache; using the first, from namespace 'ggbio'
Also defined by 'hexbin'
Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
Also defined by 'hexbin'
Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
Also defined by 'hexbin'

#
(mycols <- brewer.pal(8, "Dark2")[1:length(unique(samples$condition))])
[1] "#1B9E77" "#D95F02"
sampleDists <- as.matrix(dist(t(assay(rld))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "blue", "white"),
          ColSideColors=mycols[samples$condition],
          RowSideColors=mycols[samples$condition],
          margin=c(10, 10), main="Sample Distance Matrix")

sampleDists <- dist(t(assay(rld)))

sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(rld$condition, rld$sample, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

#PCA Alternative code
#DESeq2::plotPCA(rld, intgroup="condition")
pcaData <- plotPCA(rld, intgroup="condition", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=condition)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()

table(resLFC$padj<0.05)

FALSE  TRUE 
13675  1206 
res <- resLFC[order(resLFC$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)

Data Visualization

# Volcano plot
with(resLFC, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot",ylim=c(0,10), xlim=c(-2,2)))
# Add colored points: red if padj<0.05, orange of log2FC>1, green if both)
with(subset(res, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
with(subset(res, abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="orange"))
with(subset(res, padj<.05 & abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="green"))

#Assigning the top 20 significant genes 
topVargenes20 <- head(order(rowVars(assay(rld)), decreasing = TRUE), 20)
topVargenes500 <- head(order(rowVars(assay(rld)), decreasing = TRUE), 500)


#Making a heatmap of the top 20 significant genes 
mat <- assay(rld)[topVargenes20, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)["condition"])
pheatmap(mat, annotation_col = anno)



mat <- assay(rld)[topVargenes500, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)["condition"])
pheatmap(mat, scale = "row", show_rownames = F,
         clustering_method = 'average',  annotation_col = anno, cutree_cols = 3)

Pathway Analysis

#Creating datasets with KEGG gene sets to test
kegg_mouse <- kegg.gsets(species = "mouse", id.type = "kegg")
names(kegg_mouse)
[1] "kg.sets"    "sigmet.idx" "sig.idx"    "met.idx"    "dise.idx"  
kegg.gs <- kegg_mouse$kg.sets[kegg_mouse$sigmet.idx]
res$symbol = mapIds(org.Mm.eg.db,
                     keys=row.names(res), 
                     column="SYMBOL",
                     keytype="ENSEMBL",
                     multiVals="first")
'select()' returned 1:many mapping between keys and columns
res$entrez = mapIds(org.Mm.eg.db,
                     keys=row.names(res), 
                     column="ENTREZID",
                     keytype="ENSEMBL",
                     multiVals="first")
'select()' returned 1:many mapping between keys and columns
res$name =   mapIds(org.Mm.eg.db,
                     keys=row.names(res), 
                     column="GENENAME",
                     keytype="ENSEMBL",
                     multiVals="first")
'select()' returned 1:many mapping between keys and columns
foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
#Running GAGE

keggres = gage(foldchanges, gsets=kegg.gs, same.dir = T)
names(keggres)
[1] "greater" "less"    "stats"  
lapply(keggres, head) #Up-regulated and down-regulated pathways
$greater
                                               p.geomean stat.mean       p.val     q.val set.size        exp1
mmu03013 RNA transport                       0.001703931  2.952683 0.001703931 0.2109217      160 0.001703931
mmu04610 Complement and coagulation cascades 0.001787472  2.969962 0.001787472 0.2109217       71 0.001787472
mmu03010 Ribosome                            0.004307706  2.658194 0.004307706 0.3388729      134 0.004307706
mmu04621 NOD-like receptor signaling pathway 0.007573771  2.443348 0.007573771 0.4468525      172 0.007573771
mmu03040 Spliceosome                         0.009753212  2.354265 0.009753212 0.4603516      128 0.009753212
mmu04380 Osteoclast differentiation          0.015603079  2.171236 0.015603079 0.6137211      116 0.015603079

$less
                                          p.geomean stat.mean        p.val       q.val set.size         exp1
mmu04142 Lysosome                      5.775434e-06 -4.484937 5.775434e-06 0.001363002      121 5.775434e-06
mmu00100 Steroid biosynthesis          1.848401e-03 -3.148434 1.848401e-03 0.138864576       18 1.848401e-03
mmu01212 Fatty acid metabolism         2.241459e-03 -2.900056 2.241459e-03 0.138864576       59 2.241459e-03
mmu00071 Fatty acid degradation        2.353637e-03 -2.911957 2.353637e-03 0.138864576       42 2.353637e-03
mmu00531 Glycosaminoglycan degradation 4.268060e-03 -2.801355 4.268060e-03 0.162815564       18 4.268060e-03
mmu01240 Biosynthesis of cofactors     4.637380e-03 -2.619927 4.637380e-03 0.162815564      141 4.637380e-03

$stats
                                             stat.mean     exp1
mmu03013 RNA transport                        2.952683 2.952683
mmu04610 Complement and coagulation cascades  2.969962 2.969962
mmu03010 Ribosome                             2.658194 2.658194
mmu04621 NOD-like receptor signaling pathway  2.443348 2.443348
mmu03040 Spliceosome                          2.354265 2.354265
mmu04380 Osteoclast differentiation           2.171236 2.171236
#head(keggres$greater) #Up-regulated pathways
#head(keggres$less) # Down-regulated pathways

# Explore the top 20 up-regulated pathways and KEGG IDs
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>% 
  tbl_df() %>% 
  filter(row_number()<=20) %>% 
  .$id %>% 
  as.character()
`tbl_df()` was deprecated in dplyr 1.0.0.
Please use `tibble::as_tibble()` instead.
keggrespathways
 [1] "mmu03013 RNA transport"                                "mmu04610 Complement and coagulation cascades"         
 [3] "mmu03010 Ribosome"                                     "mmu04621 NOD-like receptor signaling pathway"         
 [5] "mmu03040 Spliceosome"                                  "mmu04380 Osteoclast differentiation"                  
 [7] "mmu04924 Renin secretion"                              "mmu04662 B cell receptor signaling pathway"           
 [9] "mmu04657 IL-17 signaling pathway"                      "mmu03008 Ribosome biogenesis in eukaryotes"           
[11] "mmu00970 Aminoacyl-tRNA biosynthesis"                  "mmu04670 Leukocyte transendothelial migration"        
[13] "mmu04261 Adrenergic signaling in cardiomyocytes"       "mmu04970 Salivary secretion"                          
[15] "mmu04972 Pancreatic secretion"                         "mmu04672 Intestinal immune network for IgA production"
[17] "mmu04640 Hematopoietic cell lineage"                   "mmu04915 Estrogen signaling pathway"                  
[19] "mmu04024 cAMP signaling pathway"                       "mmu04668 TNF signaling pathway"                       
# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids
 [1] "mmu03013" "mmu04610" "mmu03010" "mmu04621" "mmu03040" "mmu04380" "mmu04924" "mmu04662" "mmu04657" "mmu03008"
[11] "mmu00970" "mmu04670" "mmu04261" "mmu04970" "mmu04972" "mmu04672" "mmu04640" "mmu04915" "mmu04024" "mmu04668"
# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id=pid, species="mouse", new.signature=FALSE)

#plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id=pid, species="mouse"))
'select()' returned 1:1 mapping between keys and columns
Info: Working in directory /home/aaltamirano/Documents/nsbb552
Info: Writing image file mmu03013.pathview.png
# Open inflammation-associated pathway .png files
feh ~/Documents/nsbb552/mmu04662.pathview.png #"mmu04662 B cell receptor signaling pathway"
feh ~/Documents/nsbb552/mmu04657.pathview.png #"mmu04657 IL-17 signaling pathway"
feh ~/Documents/nsbb552/mmu04672.pathview.png #"mmu04672 Intestinal immune network for IgA production"
feh ~/Documents/nsbb552/mmu04668.pathview.png #"mmu04668 TNF signaling pathway"  

Discussion

After processing and cleaning the dataset, there is some clear differential expression between the breast fed (Condition A: Control) and the formula fed (Condition B: NEC) groups. This difference is most clearly observed when comparing samples 2,3,4 and 14,15,17, and 19. The top 20 most differentiated genes didn’t have a strong association with inflammation pathways. Aditionally, downstream pathway analyses of p vale<0.05 were to stringent, so p-value<0.1 was used. In conlcusion, this dataset does show differntial expression of genes in pro-inflammation pathways. Further pathways analyses with IPA and possible gene enrichment would be useful in pursuing these connections.
## Doesn't make biological sense 
# Create background dataset for hypergeometric testing using all genes tested for significance in the results                  
all_genes <- as.character(rownames(res)) 
# Extract significant results 
signif_res <- res[res$padj < 0.07 & !is.na(res$padj), ] 
signif_genes <- as.character(rownames(signif_res)) 

ego <- enrichGO(gene = signif_genes, universe = all_genes, keyType = "ENSEMBL", OrgDb = org.Mm.eg.db, ont = "BP",pAdjustMethod = "BH", qvalueCutoff = 0.07, readable = TRUE) 
# Output results from GO analysis to a table 
cluster_summary <- data.frame(ego) 
#Visualizing 
dotplot(ego, showCategory=50)

#emapplot(ego, showCategory=50)

# To color genes by log2 fold changes 
signif_res_lFC <- signif_res$log2FoldChange 
cnetplot(ego, categorySize="pvalue", showCategory = 5, foldChange= signif_res_lFC, vertex.label.font=6)

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] clusterProfiler_4.0.0       biomaRt_2.48.0              pathview_1.32.0             gageData_2.30.0            
 [5] gage_2.42.0                 apeglm_1.14.0               GenomicFeatures_1.44.0      DiagrammeR_1.0.6.1         
 [9] dplyr_1.0.6                 org.Mm.eg.db_3.13.0         AnnotationDbi_1.54.1        genefilter_1.74.0          
[13] ggplot2_3.3.3               gplots_3.1.1                RColorBrewer_1.1-2          vsn_3.60.0                 
[17] pheatmap_1.0.12             DESeq2_1.32.0               SummarizedExperiment_1.22.0 Biobase_2.52.0             
[21] MatrixGenerics_1.4.0        matrixStats_0.59.0          GenomicRanges_1.44.0        GenomeInfoDb_1.28.0        
[25] IRanges_2.26.0              S4Vectors_0.30.0            BiocGenerics_0.38.0         tximportData_1.20.0        
[29] readr_1.4.0                 tximport_1.20.0            

loaded via a namespace (and not attached):
  [1] utf8_1.2.1               tidyselect_1.1.0         RSQLite_2.2.7            htmlwidgets_1.5.3       
  [5] grid_4.1.0               BiocParallel_1.26.0      scatterpie_0.1.6         munsell_0.5.0           
  [9] preprocessCore_1.54.0    withr_2.4.2              colorspace_2.0-1         GOSemSim_2.18.0         
 [13] filelock_1.0.2           OrganismDbi_1.34.0       knitr_1.33               rstudioapi_0.13         
 [17] DOSE_3.18.0              labeling_0.4.2           KEGGgraph_1.52.0         bbmle_1.0.23.1          
 [21] GenomeInfoDbData_1.2.6   polyclip_1.10-0          bit64_4.0.5              farver_2.1.0            
 [25] downloader_0.4           coda_0.19-4              vctrs_0.3.8              treeio_1.16.1           
 [29] generics_0.1.0           xfun_0.23                biovizBase_1.40.0        BiocFileCache_2.0.0     
 [33] R6_2.5.0                 graphlayouts_0.7.1       locfit_1.5-9.4           AnnotationFilter_1.16.0 
 [37] bitops_1.0-7             cachem_1.0.5             reshape_0.8.8            fgsea_1.18.0            
 [41] DelayedArray_0.18.0      assertthat_0.2.1         BiocIO_1.2.0             scales_1.1.1            
 [45] ggraph_2.0.5             nnet_7.3-15              enrichplot_1.12.0        gtable_0.3.0            
 [49] affy_1.70.0              ggbio_1.40.0             tidygraph_1.2.0          ensembldb_2.16.0        
 [53] rlang_0.4.11             splines_4.1.0            rtracklayer_1.52.0       lazyeval_0.2.2          
 [57] hexbin_1.28.2            dichromat_2.0-0          checkmate_2.0.0          BiocManager_1.30.15     
 [61] yaml_2.2.1               reshape2_1.4.4           backports_1.2.1          qvalue_2.24.0           
 [65] Hmisc_4.5-0              RBGL_1.68.0              tools_4.1.0              affyio_1.62.0           
 [69] ellipsis_0.3.2           Rcpp_1.0.6               plyr_1.8.6               visNetwork_2.0.9        
 [73] base64enc_0.1-3          progress_1.2.2           zlibbioc_1.38.0          purrr_0.3.4             
 [77] RCurl_1.98-1.3           prettyunits_1.1.1        rpart_4.1-15             viridis_0.6.1           
 [81] cowplot_1.1.1            ggrepel_0.9.1            cluster_2.1.1            magrittr_2.0.1          
 [85] data.table_1.13.6        DO.db_2.9                mvtnorm_1.1-2            ggnewscale_0.4.5        
 [89] ProtGenerics_1.24.0      evaluate_0.14            hms_1.0.0                patchwork_1.1.1         
 [93] xtable_1.8-4             XML_3.99-0.6             emdbook_1.3.12           jpeg_0.1-8.1            
 [97] gridExtra_2.3            bdsmatrix_1.3-4          compiler_4.1.0           tibble_3.1.2            
[101] KernSmooth_2.23-18       crayon_1.4.1             shadowtext_0.0.8         htmltools_0.5.1.1       
[105] Formula_1.2-4            tidyr_1.1.2              geneplotter_1.70.0       aplot_0.0.6             
[109] DBI_1.1.1                tweenr_1.0.2             dbplyr_2.1.0             MASS_7.3-53.1           
[113] rappdirs_0.3.3           Matrix_1.3-2             cli_2.5.0                igraph_1.2.6            
[117] pkgconfig_2.0.3          rvcheck_0.1.8            GenomicAlignments_1.28.0 numDeriv_2016.8-1.1     
[121] foreign_0.8-81           ggtree_3.0.2             annotate_1.70.0          XVector_0.32.0          
[125] stringr_1.4.0            VariantAnnotation_1.38.0 digest_0.6.27            graph_1.70.0            
[129] Biostrings_2.60.1        rmarkdown_2.6            fastmatch_1.1-0          tidytree_0.3.4          
[133] htmlTable_2.2.1          restfulr_0.0.13          curl_4.3.1               gtools_3.9.2            
[137] Rsamtools_2.8.0          rjson_0.2.20             lifecycle_1.0.0          nlme_3.1-152            
[141] jsonlite_1.7.2           limma_3.48.0             viridisLite_0.4.0        BSgenome_1.60.0         
[145] fansi_0.5.0              pillar_1.6.1             lattice_0.20-41          GGally_2.1.1            
[149] KEGGREST_1.32.0          fastmap_1.1.0            httr_1.4.2               survival_3.2-7          
[153] GO.db_3.13.0             glue_1.4.2               png_0.1-7                Rgraphviz_2.36.0        
[157] bit_4.0.4                ggforce_0.3.3            stringi_1.5.3            blob_1.2.1              
[161] org.Hs.eg.db_3.13.0      caTools_1.18.2           latticeExtra_0.6-29      memoise_2.0.0           
[165] ape_5.5                 

Relevant Documentation

  1. rnaseqGene: end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages (http://master.bioconductor.org/packages/release/workflows/html/rnaseqGene.html)
browseVignettes("rnaseqGene")
  1. DESeq2: Differential gene expression analysis based on the negative binomial distribution (https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
browseVignettes("DESeq2")
  1. Salmon: A tool for quantifying the expression of transcripts using RNA-seq data (https://combine-lab.github.io/salmon/getting_started/#after-quantification)
  • Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods.
  1. apeglm
---
title: "NSBB 552 Final Project"
output: html_notebook
---

## Background:

##### Necrotizing enterocolitis (NEC) is a gastrointestinal disease that is the leading cause of premature infant death in the NICU, with a mortality rate of 20--50%. NEC is difficult to diagnose due to late stage signs such as abdominal distension, blood in stool, microbial imbalance, and gas in intestinal walls (pneumatosis intestinalis) which quickly progresses to local and systemic inflammation, multi-organ failure, and death. Even if the premature infant survives NEC, they may lead a decreased quality of life due to permanent bowel issues and neurodevelopmental delays from NEC exposure. The NEC profile is characterized by up-regulation of pro-inflammatory markers such as IL-6, IL-8, IL-1B, and TNF alpha.

## Hypothesis:

##### We hypothesize that the development and severity of NEC can be attenuated by through inhibition of pro-inflammation pathways.

### Dataset Information

##### link: <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154617>

##### SRA: SRP272342

##### file type: fastq.gz

##### Summary: Comparison of intestinal epithelial cell tissue of breast fed (control) and formula fed (NEC model) mice at P4

##### Importing Libraries

```{r}
library(tximport)
library(readr)
library(tximportData)
library(DESeq2) #nra-seq analysis 
library(pheatmap)
library(vsn)
library(RColorBrewer)
library(gplots)
library(ggplot2)
library(genefilter)
library(AnnotationDbi)
library(org.Mm.eg.db)
library(dplyr)
library(DiagrammeR)
library(GenomicFeatures)
library(apeglm)
library(gage)
library(gageData)
library(pathview)
library(biomaRt)
library(clusterProfiler)
```

### Design

```{r}
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']

      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4 -> tab5;
      }

      [1]: 'Obtain data'
      [2]: 'Process and clean raw data: fastq to readCounts'
      [3]: 'Observe sample quality'
      [4]: 'Data Visualization'
      [5]: 'Pathway Analysis' 
      ")
```

# **Obtain data**

### Downloading raw, zipped files from NCBI GEO repository

```{bash}
#process ()
#{
#        fasterq-dump $SRR
#        mv ${SRR}_1.fastq ${sample}_1.fastq; mv ${SRR}_2.fastq ${sample}_2.fastq
#        gzip ${sample}_1.fastq  ${sample}_2.fastq
#}

#for GSM in $(grep -i -v i3c GSMtable.txt | cut -f1)
#do
#        SRR=$(grep $GSM runInfo.txt | cut -f1)
#        sample=$(grep $GSM GSMtable.txt | cut -f2 | awk '{gsub(" ", ""); print}')
#        #echo $sample
#        process &
#done
#wait
#echo done 

```

### Importing raw files into alignment tool (Salmon)

```{bash}
#for fn in sample.e144_14_ff sample.e144_17_ff sample.e144_19_ff sample.e144_1_bf sam>
#do  
#echo "Processing sample $fn"
#salmon quant -i /home/aaltamirano/Documents/nsbb552/genome_folder/alias/mm10/salmon_>
#        -1 ${fn}_1.fastq \
#        -2 ${fn}_2.fastq \
#        -p 24 --validateMappings -o quants/Salmon/${fn}
#done 
#echo "done" 
```

### Setting up files for DESeq2 Analysis

```{r}
# Setting working directory
dir <- "/home/aaltamirano/Documents/nsbb552/quants/Salmon"
list.files(dir)
```

```{r}
# Make a gene ID x Transcript name data frame from reference genome (tx2gene)
# Order of columns matters 
txdb <- makeTxDbFromGFF("/home/aaltamirano/Documents/nsbb552/quants/Salmon/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf")
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- AnnotationDbi::select(txdb, keys = k, keytype = "TXNAME", columns = "GENEID")
head(tx2gene)
```

```{r}
#Make a samples.txt file with sample IDs - should match the sample IDs used for alignment/mapping
#Assign samples.txt to the samples variable
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples$condition <- factor(rep(c("A","B"),each=4)) #Adds a condition column
samples
# A=control B=NEC


files <- file.path(dir, samples$sample, "quant.sf")
names(files) <- paste0(samples$sample)
#files
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion = TRUE)
#names(txi.salmon)
head(txi.salmon$counts)

```

```{r}
#all(file.exists(files))
#head(files)
#file.exists(files)
data.frame(thefiles = files, doihave = file.exists(files))
```

### Loading deseq2

```{r}
dds <- DESeqDataSetFromTximport(txi.salmon, samples, ~condition)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
```

```{r}
dds$condition <- factor(dds$condition, levels = c("A","B"))
```

# **Processing data**

```{r}
dds <- DESeq(dds)
dds
res <- results(dds)
#res <- res[res$log2FoldChange >= 1, ]
summary(res)
```

# **Clean data**

```{r}
## log fold change shrinkage for visualization and ranking
resultsNames(dds)
resLFC <- lfcShrink(dds, coef="condition_B_vs_A", type="apeglm")
resLFC
```

```{r}
## Organizing results by smallest P-value
resOrdered <- res[order(res$pvalue),]
resOrdered

## How many adjusted p-values were less than 0.01?
sum(res$padj < 0.1, na.rm=TRUE)

## Changing p-value to 0.05 
res05 <- results(dds, alpha=0.05)
summary(res05)
## How many adjusted p-values were less htan 0.05?
sum(res05$padj < 0.05, na.rm=TRUE)
```

# **Observe quality of data**

```{r}
plotMA(res, ylim=c(-2,2))
plotMA(resLFC, ylim=c(-2,2))
```

```{r}
#checked 
plotDispEsts(dds, main="Dispersion plot")
```

```{r}
mcols(resLFC)$description
```

```{r}
ntd <- normTransform(dds)
rld <- rlogTransformation(dds)
head(assay(rld))

vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)

meanSdPlot(assay(ntd))
meanSdPlot(assay(vsd))
meanSdPlot(assay(rld))

#Shows the effect of the transformation, i
ddsESF <- estimateSizeFactors(dds)
df1 <- data.frame(log2(counts(ddsESF, normalized=TRUE)[, 1:2] + 1))
df1$transformation <- "log2(x + 1)"
df2 <- data.frame(assay(rld)[, 1:2])
df2$transformation <- "rld"
df3 <- data.frame(assay(vsd)[, 1:2])
df3$transformation <- "vsd"
df <- rbind(df1, df2, df3)
colnames(df)[1:2] <- c("x", "y")
head(df)
table(df$transformation)

options(repr.plot.width=6, repr.plot.height=2.5)
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
coord_fixed() + facet_grid( . ~ transformation)
#rld graph compresses differences for the low count genes > excludes genes with low counts

par(mfrow=c(1,3))
boxplot(log2(assay(ddsESF)+1), las=2, main="log2(x+1)")
boxplot(assay(rld), las=2, main="rld")
boxplot(assay(vsd), las=2, main="vsd")
```

```{r}

select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]
df <- as.data.frame(colData(dds)["condition"])
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)
```

```{r}
#
(mycols <- brewer.pal(8, "Dark2")[1:length(unique(samples$condition))])
sampleDists <- as.matrix(dist(t(assay(rld))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "blue", "white"),
          ColSideColors=mycols[samples$condition],
          RowSideColors=mycols[samples$condition],
          margin=c(10, 10), main="Sample Distance Matrix")
```

```{r}
sampleDists <- dist(t(assay(rld)))

sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(rld$condition, rld$sample, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)
```

```{r}
#PCA Alternative code
#DESeq2::plotPCA(rld, intgroup="condition")
```

```{r}
pcaData <- plotPCA(rld, intgroup="condition", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=condition)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()
```

```{r}
table(resLFC$padj<0.05)
res <- resLFC[order(resLFC$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)
```

# **Data Visualization**

```{r}
# Volcano plot
with(resLFC, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot",ylim=c(0,10), xlim=c(-2,2)))
# Add colored points: red if padj<0.05, orange of log2FC>1, green if both)
with(subset(res, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
with(subset(res, abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="orange"))
with(subset(res, padj<.05 & abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="green"))
```

```{r}
#Assigning the top 20 significant genes 
topVargenes20 <- head(order(rowVars(assay(rld)), decreasing = TRUE), 20)
topVargenes500 <- head(order(rowVars(assay(rld)), decreasing = TRUE), 500)


#Making a heatmap of the top 20 significant genes 
mat <- assay(rld)[topVargenes20, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)["condition"])
pheatmap(mat, annotation_col = anno)


mat <- assay(rld)[topVargenes500, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)["condition"])
pheatmap(mat, scale = "row", show_rownames = F,
         clustering_method = 'average',  annotation_col = anno, cutree_cols = 3)
```

# **Pathway Analysis**

```{r}
#Creating datasets with KEGG gene sets to test
kegg_mouse <- kegg.gsets(species = "mouse", id.type = "kegg")
names(kegg_mouse)
kegg.gs <- kegg_mouse$kg.sets[kegg_mouse$sigmet.idx]
```

```{r}
res$symbol = mapIds(org.Mm.eg.db,
                     keys=row.names(res), 
                     column="SYMBOL",
                     keytype="ENSEMBL",
                     multiVals="first")
res$entrez = mapIds(org.Mm.eg.db,
                     keys=row.names(res), 
                     column="ENTREZID",
                     keytype="ENSEMBL",
                     multiVals="first")
res$name =   mapIds(org.Mm.eg.db,
                     keys=row.names(res), 
                     column="GENENAME",
                     keytype="ENSEMBL",
                     multiVals="first")
foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
```

```{r}
#Running GAGE

keggres = gage(foldchanges, gsets=kegg.gs, same.dir = T)
names(keggres)

lapply(keggres, head) #Up-regulated and down-regulated pathways
#head(keggres$greater) #Up-regulated pathways
#head(keggres$less) # Down-regulated pathways

# Explore the top 20 up-regulated pathways and KEGG IDs
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>% 
  tbl_df() %>% 
  filter(row_number()<=20) %>% 
  .$id %>% 
  as.character()
keggrespathways
# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids
```

```{r}
# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id=pid, species="mouse", new.signature=FALSE)

#plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id=pid, species="mouse"))
```

```{bash}
# Open inflammation-associated pathway .png files
feh ~/Documents/nsbb552/mmu04662.pathview.png #"mmu04662 B cell receptor signaling pathway"
feh ~/Documents/nsbb552/mmu04657.pathview.png #"mmu04657 IL-17 signaling pathway"
feh ~/Documents/nsbb552/mmu04672.pathview.png #"mmu04672 Intestinal immune network for IgA production"
feh ~/Documents/nsbb552/mmu04668.pathview.png #"mmu04668 TNF signaling pathway"  

```

# **Discussion**

##### After processing and cleaning the dataset, there is some clear differential expression between the breast fed (Condition A: Control) and the formula fed (Condition B: NEC) groups. This difference is most clearly observed when comparing samples 2,3,4 and 14,15,17, and 19. The top 20 most differentiated genes didn't have a strong association with inflammation pathways. Aditionally, downstream pathway analyses of p vale\<0.05 were to stringent, so p-value\<0.1 was used. In conlcusion, this dataset does show differntial expression of genes in pro-inflammation pathways. Further pathways analyses with IPA and possible gene enrichment would be useful in pursuing these connections.

```{r}
## Doesn't make biological sense 
# Create background dataset for hypergeometric testing using all genes tested for significance in the results                  
all_genes <- as.character(rownames(res)) 
# Extract significant results 
signif_res <- res[res$padj < 0.07 & !is.na(res$padj), ] 
signif_genes <- as.character(rownames(signif_res)) 

ego <- enrichGO(gene = signif_genes, universe = all_genes, keyType = "ENSEMBL", OrgDb = org.Mm.eg.db, ont = "BP",pAdjustMethod = "BH", qvalueCutoff = 0.07, readable = TRUE) 
# Output results from GO analysis to a table 
cluster_summary <- data.frame(ego) 
#Visualizing 
dotplot(ego, showCategory=50)
#emapplot(ego, showCategory=50)

# To color genes by log2 fold changes 
signif_res_lFC <- signif_res$log2FoldChange 
cnetplot(ego, categorySize="pvalue", showCategory = 5, foldChange= signif_res_lFC, vertex.label.font=6)
```

```{r}
sessionInfo()
```

### Relevant Documentation

1.  rnaseqGene: end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages (<http://master.bioconductor.org/packages/release/workflows/html/rnaseqGene.html>)

```{r}
browseVignettes("rnaseqGene")
```

2.  DESeq2: Differential gene expression analysis based on the negative binomial distribution (<https://bioconductor.org/packages/release/bioc/html/DESeq2.html>)

```{r}
browseVignettes("DESeq2")
```

3.  Salmon: A tool for quantifying the expression of transcripts using RNA-seq data (<https://combine-lab.github.io/salmon/getting_started/#after-quantification>)

-   

    ###### Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods.

4.  apeglm

-   

    ###### Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for

    sequence count data: removing the noise and preserving large differences. Bioinformatics. <https://doi.org/10.1093/bioinformatics/bty895>
